A short description of the post.
In the country island of Kronos, the increasing noxious effects on health and farming have been related to the uncontrolled activity of GAStech, a natural gas operator, supported by corrupt government officials. On January 20th, 2014, a corporate meeting is held to celebrate the new-found fortune because of the initial public offering of the company. However, a series of rare events occur that lead to the disappearance of several employees. The Protectors of Kronos (POK), a social movement organization that has been fighting against water contamination and government corruption, is suspected in the disappearance.
As analysts, we were assigned with several tasks in order to identify risks and how they could have been mitigated more effectively.
In the literature review conducted, refer to the Text mining with R to process the text content. From this book, I learned how to token the text content, visual the text like wordcloud.
Refer to the work of Peking University to do the visualization part of questions 2 and 3. From the article, I learned how to visual the time series in the bar graph and put the place info on the map.
Using data and visual analytics to evaluate the changing levels of risk to the public and recommend actions for first responder:
Distinguish meaningful event reports from typical chatter from junk or spam.
Evaluate the level of the risk to the public evolves over the course of the evening. Consider the potential consequences of the situation and the number of people who could be affected. Determine the appropriate location for first responders.
The differences between dealing with this challenge in 2014 and dealing with it now.
Using visual analytics, characterize the different types of content in the dataset. What distinguishes meaningful event reports from typical chatter from junk or spam? Please limit your answer to 8 images and 500 words.
Use visual analytics to represent and evaluate how the level of the risk to the public evolves over the course of the evening. Consider the potential consequences of the situation and the number of people who could be affected. Please limit your answer to 10 images and 1000 words.
If you were able to send a team of first responders to any single place, where would it be? Provide your rationale. How might your response be different if you had to respond to the events in real time rather than retrospectively? Please limit your answer to 8 images and 500 words.
If you solved this mini-challenge in 2014, how did you approach it differently this year?
First, we run this fist line of code to clear the environment and remove existing R objects(if any).
This code chunk checks if required packages are installed. If they are not installed, the next line of code will install them. The following line is then use to import the library into the current working environment.
packages = c('readr','tidytext','data.table','lubridate','ggplot2',
'caret','dplyr','tidyr','scales','quanteda','textdata',
'stringr','stringi','reshape2','RColorBrewer','wordcloud',
'forcats','igraph','ggraph','widyr','clock','knitr','tidyverse',
'DT','hms','ggiraph','topicmodels','raster','sf','maptools',
'rgdal','ggmap','sp','tmap','tmaptools','devtools','patchwork',
'patchwork')
for(p in packages){
if(!require(p,character.only = TRUE)){
install.packages(p)
}
library(p,character.only = TRUE)
}
First, use read_csv() to import the csv file.
read1 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1700-1830.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
read2 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1831-2000.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
read3 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-2001-2131.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
Using function rbind() combine these three csv files with the same format.
df <- rbind.data.frame(read1,read2,read3)
glimpse(df)
Rows: 4,063
Columns: 7
$ type <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <chr> "20140123170000", "20140123170000", "~
$ author <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
From the above table the date(yyyyMMddHHmmss) is not in time format, so converting to date-time field. Because in the mini-challenge 3, all activities occur on the same day.Extract time(hms) data without date and transform.
df$`date(yyyyMMddHHmmss)` <- date_time_parse(df$`date(yyyyMMddHHmmss)`,
zone = "",
format = "%Y%m%d %H%M%S")
df$time <- as_hms(ymd_hms((df$`date(yyyyMMddHHmmss)`)))
glimpse(df)
Rows: 4,063
Columns: 8
$ type <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <dttm> 2014-01-23 17:00:00, 2014-01-23 17:0~
$ author <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
$ time <time> 17:00:00, 17:00:00, 17:00:00, 17:00:~
In question 1, extract the required data from the df to make a new data frame. After reading all the data carefully, the terminology of mbdata and ccdata is very different, so separate the two files. Since ccdata is a police or fire department record, this dataset is labeled as a meaningful dataset.
df1 <- subset(df, select = c("type","author","message"))
df_m <- subset(df1, type == "mbdata")
df_cc <- subset(df1,type == "ccdata")
df_cc$condition <- "meaningful"
DT::datatable(df_cc,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px',targets = c(3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
JUNk definition: After reading all the data carefully, I have selected the following types of data.
Authors like “KronosQuoth” will post content related to the three types of activities(rally, fire to collapsed, accident to gunshot) in this challenge, but cannot or rarely provide important information such as location. The content is mostly emotional and cathartic.
Content that includes the term “#Grammar” is mostly unrelated to important events in mini-challenge 3.
Content containing “RT” is a forwarding of other people’s messages, providing the same information repeatedly. And, the number of “RT” message is very large.
junk <- df_m %>%
filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
panopticon|dels4realz|eazymoney|cleaningFish")|
str_detect(message,"#Grammar|RT"))
Meaningful definition: After reading all the data carefully, I have selected the following types of data.
Information released by authors who are more official or authoritative. Author like “POK”,“AbilaPoliceDepartment”.
Information released by authors are witnesses to events and publishes information that provides important information. Authors like “magaMan”,“Sara_Nespola”.
Content that includes terms like “fire”,“rally” is mostly related to the important events in mini-challenge 3.
meaningful <- df_m %>%
filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
str_detect(message,[2191 chars quoted with '"']))
Meaningless definition: After reading all the data carefully, I have selected the following types of data.
Content that includes terms like “fire”,“rally” which is related to the events but is used to express emotions and can not provide important information.
Content that is unrelated to the events, but has their own objectives like sharing the information of shops.
This group is obtained by subtracting other groups from the df_m through anti_join() function.
meaningful <- dplyr::anti_join(meaningful,junk,by = c("type", "author", "message"))
combinedata <- rbind.data.frame(meaningful, junk)
meaningless <- dplyr::anti_join(df_m,combinedata,by = c("type", "author", "message"))
Combine meaningful,meaningless and junk data and add a new label column.
junk$condition <- "junk"
meaningful$condition <- "meaningful"
meaningless$condition <- "meaningless"
finalq1 <- rbind.data.frame(meaningful,junk, meaningless)
DT::datatable(finalq1,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Use stringr package to remove punctuation, @, #, < and Chinese characters from messages. The messages in ccdata are very clean and do not require special handling.
finalq1$message <- str_replace_all(finalq1$message,'[[:punct:]]+', "")
finalq1$message <- str_replace_all(finalq1$message,fixed("@"),"")
finalq1$message <- str_replace_all(finalq1$message,fixed("#"),"")
finalq1$message <- str_replace_all(finalq1$message,fixed("<"),"")
finalq1$message <- str_replace_all(finalq1$message,"[\u4e00-\u9fa5]+", "")
The messages in the table below is clean.
Exclude stop words from the text and use tibble() to custom stop words selected according to the content of the text.
tidy_m <- finalq1 %>%
unnest_tokens(word, message) %>%
count(condition,word,sort = TRUE)
data(stop_words)
tidy_m <- tidy_m %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
tidy_m <- tidy_m %>%
anti_join(my_stopwords)
tidy_cc <- df_cc %>%
unnest_tokens(word,message) %>%
count(word, sort = TRUE)
DT::datatable(tidy_m,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
The graph below is the word number distribution of junk,meaningful and meaningless group. The junk message has high repetition rate of words.
ggplot(tidy_m,aes(n,fill = condition))+
geom_histogram(show.legend = FALSE)+
scale_fill_brewer(palette = "Pastel1")+
xlim(0,100)+
facet_wrap(~condition, ncol = 2,scales = "free_y")

The graph below is the top 15 word(n) of junk,meaningful and meaningless group.
Junk group
Most of them are words related to POK Rally, used to vent feelings or words related to success. There are also statements against the government and police. Words like “success”,“life”.
There are also obvious spam features such as rt and grammar. These message number is very large and do not provide useful information.
Meaningful group
Words can provide information about the important people of the events. Words like “viktore”.
Words can provide information on the important processes of the events. Words like “standoff”.
Words can provide information on the important locations of the events. words like “dolphin”,“dancing”.
Meaningless group
Content sent by individuals for certain purposes.
Content sent by businesses to promote their company or products. Word like “credit” is issued to promote a credit card.
tidy_m %>%
group_by(condition) %>%
slice_max(n, n= 15) %>%
ungroup() %>%
mutate(word = reorder_within(word,n,condition)) %>%
ggplot(aes(x = n,
y= word,
fill =condition))+
geom_col(show.legend = FALSE)+
scale_fill_brewer(palette = "Pastel1")+
scale_y_reordered()+
facet_wrap(~ condition, ncol = 2,scales = "free")+
ggtitle("mbdata") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)

The graph of ccdata shows that the events like “fire”,“traffic”. Word like “vehicle” is issued to the event–“From hit-and-run accident to shooting and standoff”.
tidy_cc %>%
slice_max(n,n = 15) %>%
ggplot(aes(x = n,
y= reorder(word,n)))+
geom_col(show.legend = FALSE, fill = "darkgoldenrod1")+
ggtitle("ccdata_meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.45))+
labs(y = NULL)

The conclusion is similar to Simple EDA.
wordcloud_m <- tidy_m
wordcloud_m <- finalq1 %>%
filter(condition == "meaningful") %>%
unnest_tokens(word, message)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
count(word,sort = TRUE)%>%
with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
filter(condition == "meaningless") %>%
unnest_tokens(word, message)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
count(word,sort = TRUE)%>%
with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
filter(condition == "junk") %>%
unnest_tokens(word, message)%>%
count(word,sort = TRUE)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
with(wordcloud(word,n,max.words = 100))
tidy_cc %>%
with(wordcloud(word,n,max.words = 100))

Use the bind_tf_idf() to find the important words for the different categories.
Junk: The graph shows that for junk message, retweets(like “rt”), social media accounts(like “kronosstar”) and emotive(like “success”) words are important component.
Meaningful: The graph shows that for meaningful message, government departments(like “apd”,“afd”) or emergency services, important people at events(like “sylvia”) are important component.
Meaningless: The graoh shows that for meaningless message, company event slogans(like “nobanks”) are important component.
m_tf_idf <- tidy_m %>%
bind_tf_idf(word,condition,n)
m_tf_idf %>%
arrange(desc(tf_idf))
# A tibble: 4,111 x 6
condition word n tf idf tf_idf
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 junk rt 1000 0.0558 1.10 0.0613
2 junk kronosstar 884 0.0493 0.405 0.0200
3 junk homelandilluminations 183 0.0102 1.10 0.0112
4 junk grammar 157 0.00876 1.10 0.00962
5 junk abilapost 330 0.0184 0.405 0.00746
6 junk rally 260 0.0145 0.405 0.00588
7 meaningless cards 9 0.00521 1.10 0.00572
8 meaningless easycreditkronosmorecredit 9 0.00521 1.10 0.00572
9 meaningless nobanks 9 0.00521 1.10 0.00572
10 meaningful abilapost 70 0.0137 0.405 0.00555
# ... with 4,101 more rows
m_tf_idf %>%
group_by(condition) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
mutate(word = reorder_within(word,tf_idf,condition)) %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = condition))+
scale_fill_brewer(palette = "Pastel1")+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~condition,ncol = 2,scales = "free")+
ggtitle("mbdata") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)

Use the bigram to find the important phrases for the different categories.
Actually,there is no big difference for these three categories. But meaningful messages contain more specific time and place phrases.
meaningful_bigrams <- meaningful %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningful_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 5,775 x 2
bigram n
<chr> <int>
1 viktor e 48
2 of the 42
3 dancing dolphin 41
4 in the 40
5 at the 38
6 abila centralbulletin 30
7 pok rally 28
8 to the 24
9 dr newman 23
10 dolphin fire 20
# ... with 5,765 more rows
meaningful_separated <- meaningful_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
meaningful_filtered <- meaningful_separated %>%
filter(!word1 %in% my_stopwords) %>%
filter(!word2 %in% my_stopwords)
meaningful_filtered <- meaningful_filtered %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
meaningful_counts <- meaningful_filtered %>%
count(word1, word2, sort = TRUE)
meaningful_graph <- meaningful_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
junk_bigrams <- junk %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
junk_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 7,338 x 2
bigram n
<chr> <int>
1 pokrally hi 670
2 kronosstar pokrally 598
3 pok rally 234
4 rt homelandilluminations 183
5 rt abilapost 169
6 rally grammar 157
7 rt kronosstar 143
8 if you 126
9 of the 115
10 you can 102
# ... with 7,328 more rows
junk_separated <- junk_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
junk_filtered <- junk_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
junk_counts <- junk_filtered %>%
count(word1, word2, sort = TRUE)
junk_graph <- junk_counts %>%
filter(n > 50) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
meaningless_bigrams <- meaningless %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningless_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 2,051 x 2
bigram n
<chr> <int>
1 badprofiles.kronos tacky 12
2 of the 12
3 viktor e 10
4 abila nobanks 9
5 abila pictures 9
6 cards get 9
7 credit cards 9
8 easy credit 9
9 easycredit.kronos morecredit 9
10 get what 9
# ... with 2,041 more rows
meaningless_separated <- meaningless_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
meaningless_filtered <- meaningless_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
meaningless_counts <- meaningless_filtered %>%
count(word1, word2, sort = TRUE)
meaningless_graph <- meaningless_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()

We can distinguish the differences according to the characteristics of the words used in each category.
Junk: The characteristics of the words in junk category:
Most of them are words related to POK Rally, used to vent feelings or words related to success. There are also statements against the government and police. Words like “success”,“life”.
There are also obvious spam features such as rt and grammar. These message number is very large and do not provide useful information.
There are many social media accounts(like “kronosstar”).
Meaningful: The characteristics of the words in meaningful category:
Words can provide information about the important people of the events. Words like “viktore”.
Words can provide information on the important processes of the events and important government or emergency services. Words like “standoff”,“afd” and “apd”.
Words can provide information on the important locations of the events. words like “dolphin”,“dancing”.
Meaningless: The characteristics of the words in meaningless category:
Content sent by individuals for certain purposes.
Content sent by businesses to promote their company or products. Word like “credit” is issued to promote a credit card.
Repeating the process above and get a whole meaningful dataset.
q2_m <- subset(df, type == "mbdata")
q2_cc <- subset(df,type == "ccdata")
q2_junk <- q2_m %>%
filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
panopticon|dels4realz|eazymoney|cleaningFish")|
str_detect(message,"#Grammar|RT"))
q2_meaningful <- q2_m %>%
filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
str_detect(message,[2191 chars quoted with '"']))
q2_meaningful <- dplyr::anti_join(q2_meaningful,q2_junk)
q2_lda <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
q2_lda <- na.omit(q2_lda)
tidy_q2 <- q2_lda %>%
unnest_tokens(word,message)
q2_wordcount <- tidy_q2 %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_wordcount <- q2_wordcount %>%
anti_join(my_stopwords) %>%
count(author,word,sort = TRUE)
q2_wordcount
# A tibble: 453 x 3
author word n
<chr> <chr> <int>
1 footfingers pok 21
2 truccotrucco standoff 21
3 truccotrucco im 12
4 truccotrucco shooting 12
5 truccotrucco gelatogalore 9
6 truccotrucco van 9
7 truthforcadau viktor 8
8 footfingers kronos 7
9 dangermice abilafire 6
10 footfingers people 6
# ... with 443 more rows
q2_dtm <- q2_wordcount %>%
cast_dfm(author,word,n)
q2_author_lda <- LDA(q2_dtm,k = 3, control = list(seed = 1234))
q2_topics <- tidy(q2_author_lda,matrix = "beta")
q2_topics
# A tibble: 1,092 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.37e- 1
2 2 pok 2.64e-82
3 3 pok 1.84e- 2
4 1 standoff 1.25e-81
5 2 standoff 2.41e- 4
6 3 standoff 7.63e- 2
7 1 im 3.95e-82
8 2 im 5.06e- 5
9 3 im 4.28e- 2
10 1 shooting 1.24e-87
# ... with 1,082 more rows
q2_topics %>%
group_by(topic) %>%
top_n(10,beta) %>%
ungroup() %>%
mutate(term = reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill = topic))+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")+
ggtitle("meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)
q2_author_lda2 <- LDA(q2_dtm,k = 4, control = list(seed = 1234))
q2_topics2 <- tidy(q2_author_lda2,matrix = "beta")
q2_topics2
# A tibble: 1,456 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.37e- 1
2 2 pok 6.08e-121
3 3 pok 1.92e- 2
4 4 pok 3.27e-115
5 1 standoff 6.36e-121
6 2 standoff 2.68e- 2
7 3 standoff 6.71e- 2
8 4 standoff 2.70e-104
9 1 im 3.87e-121
10 2 im 1.34e- 2
# ... with 1,446 more rows
q2_topics2 %>%
group_by(topic) %>%
top_n(10,beta) %>%
ungroup() %>%
mutate(term = reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill = topic))+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")+
ggtitle("meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)
q2_author_lda3 <- LDA(q2_dtm,k = 5, control = list(seed = 1234))
q2_topics3 <- tidy(q2_author_lda3,matrix = "beta")
q2_topics3
# A tibble: 1,820 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.75e- 1
2 2 pok 1.00e-159
3 3 pok 1.92e- 2
4 4 pok 3.64e-154
5 5 pok 7.14e- 2
6 1 standoff 1.46e-157
7 2 standoff 2.68e- 2
8 3 standoff 6.71e- 2
9 4 standoff 1.03e-129
10 5 standoff 4.84e-154
# ... with 1,810 more rows
q2_topics3 %>%
group_by(topic) %>%
top_n(5,beta) %>%
ungroup() %>%
mutate(term = reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill = topic))+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")+
ggtitle("meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)

Using the important words of pok rally event to choose the pok rally relevant messages from the meaningful dataset.
q2_rally_m <- q2_meaningful %>%
filter(str_detect(message,"pokrally|Abila City Park|Stand Up Speak Up|Sylvia Marek|Audrey McConnell Newman, Professor Lorenzo Di Stefano|Lucio Jakab|Viktor-E|Sylvia|Marek|Newman|Stefano|Di Stefano|Lucio|Jakab"))
q2_rally_cc <- q2_cc %>%
filter(str_detect(message,"ABILA CITY PARK|CROWD"))
q2_rally <- rbind(q2_rally_m,q2_rally_cc)
DT::datatable(q2_rally,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Clean the message content.
q2_rally$message <- str_replace_all(q2_rally$message,'[[:punct:]]+', "")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("@"),"")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("#"),"")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("<"),"")
q2_rally$message <- str_replace_all(q2_rally$message,"[\u4e00-\u9fa5]+", "")
Token the dataset.
q2_rally_tidy <- q2_rally %>%
unnest_tokens(word, message)
data(stop_words)
q2_rally_tidy <- q2_rally_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_rally_tidy <- q2_rally_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_rally_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Using the important words of “Fire in Dancing Dolphin” event to choose the fire relevant messages from the meaningful dataset.
q2_fire_m <- q2_meaningful %>%
filter(str_detect(message,"fire|dolphin|dancing|building|apartment|Madeg|dispatch|afd|floor|floors|fireman|firefighters|firefighter|evacuate|evacuated|evacuating|evacuation|trapped|injuries|scene|trapped|collapsed|blaze|escalated"))
q2_fire_cc <- q2_cc %>%
filter(str_detect(message,"Fire|Crime|scene"))
q2_fire <- rbind(q2_fire_m,q2_fire_cc)
DT::datatable(q2_fire,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Clean the message content.
q2_fire$message <- str_replace_all(q2_fire$message,'[[:punct:]]+', "")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("@"),"")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("#"),"")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("<"),"")
q2_fire$message <- str_replace_all(q2_fire$message,"[\u4e00-\u9fa5]+", "")
Token the dataset.
q2_fire_tidy <- q2_fire %>%
unnest_tokens(word, message)
data(stop_words)
q2_fire_tidy <- q2_fire_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_fire_tidy <- q2_fire_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_fire_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Using the important words of “From hit-and-run accident to shooting and standoff” event to choose the fire relevant messages from the meaningful dataset.
q2_accident_m <- q2_meaningful %>%
filter(str_detect(message,"shooting|stanoff|hostage|swat|negotiation|fight|arrest|hit|van| driver|bicyclist|accident|incident|bike|L829|pursuit|gun|shot|kill|dead|yelling|screaming|negotiatingnegotiator|caught|over|end|shoot|shot|chasing"))
q2_accident_cc <- q2_cc %>%
filter(str_detect(message,"van|pursuit|accident|vandalism|swat"))
q2_accident <- rbind(q2_accident_m,q2_accident_cc)
DT::datatable(q2_accident,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Clean the message content.
q2_accident$message <- str_replace_all(q2_accident$message,'[[:punct:]]+', "")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("@"),"")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("#"),"")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("<"),"")
q2_accident$message <- str_replace_all(q2_accident$message,"[\u4e00-\u9fa5]+", "")
Token the dataset.
q2_accident_tidy <- q2_accident %>%
unnest_tokens(word, message)
data(stop_words)
q2_accident_tidy <- q2_accident_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_accident_tidy <- q2_accident_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_accident_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
From the processing of the data, it was found that there were three events in the mini-challenge 3.
The POK rally event, which focuses on the start time of the rally, the time of the different speakers’ speech and the end time.
Fire in Dancing Dolphin, which focused on the start of the fire, rescue, evacuation and the final explosion.
From hit-and-run accident to shooting and standoff, which focused on the traffic accident, the hostage situation, the confrontation with the police, the negotiations and the rescue of the hostages.
POK rally was held on Abila City Park. About 2000 people gathered there and heavy police was disposed.
q2_rally_tidy%>%
filter(str_detect(word,"rally")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#99CCFF")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="rally")
q2_rally_tidy%>%
filter(str_detect(word,"pok")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#99CCFF")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="pok")

q2_rally_tidy%>%
filter(str_detect(word,"sylvia|marek")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="sylvia|marek")

The first speaker is Lucio Jakab who is the co-founder of Save Our Wildlands.He told that they had a symbiotic relationship with the earth, wildlands are their legacy for the children.
After the speech of Jakab, Viktor-E came to the stage to play the new song “River Soldiers”.
The third speaker is professor Lorenzo Di Stefano made the talk about corporate social responsibility.
The next talk was delivered by McConnell Newman, who was an internationally renown environmental scientist.
The speech ended at 18:45. Sylvia Marek thanked Dr. Newman and welcame Viktor-E back on stage to sing their hit “Stand Up Speak Up”. The rally closed around 19:05.
q2_rally_tidy%>%
filter(str_detect(word,"lucio|jakab")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
filter(str_detect(word,"viktor")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="viktor")
q2_rally_tidy%>%
filter(str_detect(word,"lorenzo|di|stefano")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="lorenzo|di|stefano")
q2_rally_tidy%>%
filter(str_detect(word,"audrey|mcConnell|newman")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="audrey|mcConnell|newman")

The fire was at Dancing Dolphin.About 2000 people gathered there.
At the earliest time (around 18:30), people reported some precursor of the fire.
The call center reported that there was possible fire at 18:40, and called for fire trucks at around 18:40.
Around 18:50, the fire trucks and ambulance arrived.
The police evacuated residents at around 18:50 and expanded evacuation area at 19:20.
There are conflicting reports of the condition of fire from 9:15.
*At 21:30, AFD reported an explosion at the Dancing Dolphin.
q2_fire_tidy%>%
filter(str_detect(word,"fire|dolphin|dancing|building|apartment")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FFCCCC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs( x= "first phrase", y = NULL)
q2_fire_tidy%>%
filter(str_detect(word,"floor|floors|upper|resident")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FFCCCC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "first phrase",y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"afd|police|cop|cops|fireman|firefighter")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FFCCCC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "first phrase",y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"ambulance|injury|injuries|evacuated|evacuating|evacuation|evacuate")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FF99CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "second phrase",y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"control")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#CC6699")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "third phrase",y = NULL)
q2_fire_tidy%>%
filter(str_detect(word,"collapsed|blaze|escalated|explosion")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#993366")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "fourth phrase",y =NULL)

The hit and run accident was highly reported at around 7:15.
At around 19:35, police pursued suspected vehicle(black van).
At around 19:45, Gun battle between police and hit-and-run drivers. And there was 2 hostages in the hands of the 2 van guys.
During 19:45-20:00, lots of people reported a police dead in the gunshot.
Starting from 20:00, Police and the hit-and-run were in a standoff and police were evacuating people.
At around 21:15, the two kidnappers dropped the gun and gave up.
q2_accident_tidy%>%
filter(str_detect(word,"hit|run|van|bicyclist|driver|incident|accident|bike|pursuit")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#99CC99")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "first phrase",y = NULL)
q2_accident_tidy%>%
filter(str_detect(word,"gun|shoot|shot|histage")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#00CC66")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "second phrase",y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"killed|dead")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#009900")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "third phrase",y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"standoff|negotiating|negotiate|negotiator|negotiation|yelling|screaming|chasing")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#006600")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "fourth phrase",y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"end|over|ocaught|rescued")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#003300")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(x= "fifth phrase",y =NULL)

Using st_as_sf() to transform variables in these three events to variables which can be used in the function tmap.
bgmap <- raster("F:/visual/assignment and project/MC3/MC3/Geospatial/MC2-tourist.tif")
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1595, 2706, 4316070 (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05 (x, y)
extent : 24.82419, 24.90976, 36.04499, 36.09543 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist.tif
names : MC2.tourist
values : 0, 255 (min, max)
abila_st <- st_read(dsn = "F:/visual/assignment and project/MC3/MC3/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`F:\visual\assignment and project\MC3\MC3\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
abila <- read_sf("F:/visual/assignment and project/MC3/MC3/Geospatial/Abila.shp")
q3_gps <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_m <- na.omit(q3_gps)
p <- gps_m %>%
count(longitude,latitude)
p$n <- as.numeric(p$n)
gps_sf <- st_as_sf(p,
coords = c("longitude","latitude"),
crs = 4326)
gps_point <- gps_sf %>%
st_cast("MULTIPOINT")
q3_rally_gps <- subset(q2_rally_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_rally_m <- na.omit(q3_rally_gps)
gps_rally_sf <- st_as_sf(gps_rally_m,
coords = c("longitude","latitude"),
crs = 4326)
q3_rally_gps <- na.omit(q3_rally_gps)
rally_count <- q3_rally_gps %>%
count(longitude,latitude)
rally_count$n <- as.numeric(rally_count$n)
gps_rally_sf <- st_as_sf(rally_count,
coords = c("longitude","latitude"),
crs = 4326)
gps_rally_point <- gps_rally_sf %>%
st_cast("MULTIPOINT")
q3_fire_gps <- subset(q2_fire_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_fire_m <- na.omit(q3_fire_gps)
gps_fire_sf <- st_as_sf(gps_fire_m,
coords = c("longitude","latitude"),
crs = 4326)
q3_fire_gps <- na.omit(q3_fire_gps)
fire_count <- q3_fire_gps %>%
count(longitude,latitude)
fire_count$n <- as.numeric(fire_count$n)
gps_fire_sf <- st_as_sf(fire_count,
coords = c("longitude","latitude"),
crs = 4326)
gps_fire_point <- gps_fire_sf %>%
st_cast("MULTIPOINT")
q3_accident_gps <- subset(q2_accident_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_accident_m <- na.omit(q3_accident_gps)
gps_accident_sf <- st_as_sf(gps_accident_m,
coords = c("longitude","latitude"),
crs = 4326)
q3_accidnet_gps <- na.omit(q3_accident_gps)
accident_count <- q3_accident_gps %>%
count(longitude,latitude)
accident_count$n <- as.numeric(accident_count$n)
accident_count <- na.omit(accident_count)
gps_accident_sf <- st_as_sf(accident_count,
coords = c("longitude","latitude"),
crs = 4326)
gps_accident_point <- gps_accident_sf %>%
st_cast("MULTIPOINT")
Use the number of the place reported in the messages as size and color in the map.
tmap_mode("view")
tm_shape(bgmap,point.per = "feature")+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_point,is.master = TRUE, point.per = 'feature')+
tm_dots(size ="n",col = "n")
Reported messages about the POK RALLY is concentrated on the abila park.
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_rally_point) +
tm_dots(size = "n",col = "n")
Reported messages about Fire in Dancing Dolphin is concentrated on the guy’s gyros.
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_fire_point) +
tm_dots(size = "n",col = "n")
Reported messages about From hit-and-run accident to shooting and standoff is concentrated on the different places, because the places of the event has changed during the pursuit process.
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_accident_point) +
tm_dots(size = "n",col = "n")
From the maps above, if you were able to send a team of first responders to any single place, I will send them to the place of accident. Because the two guys can be caught at the first time, then the gunshot won’t happen. And the place where happened accident is very close to the fire place. So they can go to fire place quickly after dealing with the accident.
If I need to respond in real time, I will send the team to the place of fire place. Because there are lot of residents are involved in and can result in bad ends.
The place I don’t choose is the place of POK rally. Although it involves a large number of people (2000 people), no one gets hurt in the rally.
The biggest different place for me is the way of showing maps. In this mini-challenge 3, I mainly use tmap() to show maps. But it was published in 2016. So if in 2014, I cannot use this package to visualization maps, may be I just can use fundamental packages to show maps like below.